Scraping a Static Website for Data in R

David Benson

Scraping a website is a technique used to extract simple data from a website. Some websites are easier than others to scrape. It helps when the data is stored in a HTML table for example, and has logical CSS class and id names.

For the production of the data visualisation “Prolific Goalscorers in the English Top Flight (1988-2015)”, I scraped the goal-scoring data of 635 players over 113 years. This short guide shows the general procedure for scraping a website for data using the football data as an example.

The website for this example is located here and has a table of the top goalscorers in the English League:

When a player is clicked on, a new page is loaded with a table of the amount of goals that player scored in each season. It is this table we want to scrape for every player. We want to scrape the name of each player, the number of goals that each player scored per season, and the season each player scored these goals.

Doing this by hand, copying and pasting each of the 635 players into excel would take ages. Doing this in R, with the help of the ‘rvest’ package, should not take much time at all.

Scraping

The first thing to do is install and load the ‘rvest’ package.

install.packages("rvest")
library(rvest)

As mentioned before, it is useful if the table is a HTML table organised with class names. To find these names, inspect the table with a browser’s developer tools. Chrome’s developer tools can be found on more tools (as shown below). All modern browsers should have developer tools or equivalent.

To inspect the player element, click on the icon on the top left of the developer tools (the one that’s a square with a cursor), and select any player from the table. The developer tools should now highlight the position of this element in the HTML tree. We can see the player name “Jimmy Greaves” is in a

cell with the class “NAME”. We can use this later for scraping.

The hyperlink for this player is in the ‘href’ attribute and is truncated to “Players/Greaves.html”. It is different for each player. We want to be directed to each of these pages and extract the resulting table. We can extract these links through the code below. The variable ‘footballers’ is created, which is simply a read of the html page. Lister is a list of all the ‘href’ attributes that are in the HTML nodes “a” and “.NAME” (. is used to reference a class, # for an id).

footballers <- read_html("http://free-elements.com/England/ts0.html")
lister <- footballers %>%
  html_nodes(".NAME a") %>%
  html_attr("href")
head(lister)
## [1] "Players/Greaves.html"   "Players/DeanDixie.html"
## [3] "Players/Bloomer.html"   "Players/Hodgson.html"  
## [5] "Players/Shearer.html"   "Players/Buchan.html"

The method above is all you need to extract a singular static html table. If instead, you aim to scrape data from multiple pages, it may be necessary to navigate to them iteratively in a loop and scrape each table in turn. The steps to do this is shown below with our football example.

Scraping in a Loop

The html address that is read into R for scraping can be altered and run for every iteration in a loop. We have already retrieved a list of addresses for each goalscorer. We now need to just scrape the table by navigating to each page using the links provided in the 'lister' list. With our example however, the hyperlinks were truncated. A base is therefore created, which can be later added to each link so that R knows the full address of each page.

base <- "http://free-elements.com/England/"

We want to scrape three things. The first is the number of goals that each player scored per season. These values have a class of ‘QTY’ (found by using the developer tools). The second was the season each player scored. These have a class of ‘WH’. Finally, the name of the player can be extracted from the ‘PURPOSE’ class.

The loop is shown below. The address for each player is created from the lister variable and this is used to change the read_html address for each loop. Apart from this, the scraping procedure is the same as when we scraped the hyperlinks without a loop. The goals, years, and name from each page can be scraped using the class labels. The results of can be saved as a list and converted into a data frame.

(Note - To create a time series for my data visualisation, only the latter year of each season was scraped. This was done using the ‘substr’ function in ‘stringr’ package)

allgoals <- list()
allyears <- list()
alltogether <- list()

for (i in seq(length(lister))) 
    {
        goals <- read_html(paste(base, lister[i], sep='')) %>%
          html_nodes('.QTY') %>%
          html_text() %>%
          as.numeric()
        
        allyears[[i]] <- read_html(paste(base, lister[i], sep='')) %>%
          html_nodes('.WH') %>%
          html_text()
        
        year <- as.numeric(substr(allyears[[i]],6,9))
        
        name <- read_html(paste(base, lister[i], sep='')) %>%
          html_nodes('.PURPOSE') %>%
          html_text()
        
        library(stringr)
        name <- str_sub(name,-50,-12)
        
        alltogether[[i]] <- data.frame(name,year,goals)
    }

Using the package ‘plyr’ the results can be binded together with ‘rbind.fill’ and ‘tidyr’ can be used to remove rows which include missing data (the goals per club row).

library(plyr)
all <- rbind.fill(alltogether)
library(tidyr)
scraped <- drop_na(all)
str(scraped)

Wrapping Up

We now have a dataset of 1280 observations. This is by no means the end. Another loop would need to be created to iteratively navigate to the next 5 lists of top goalscorers. The data needs to also be cleaned as there is likely typos and other errors hidden in it. For my data visualisation, I also needed to convert the data into a time series, where each row represented a year and each column, a player. For now however, the steps shown in this guide should provide you with a enough knowledge to scrape a static website. For a website more complicated, you may need access to their API.